Model Selection

Image-text interaction

# Image-text interaction

Smolvlm Instruct GGUF

SmolVLM is a compact open-source multimodal model that can accept image and text inputs and generate text outputs. It is designed for high efficiency and is suitable for device-side applications.

Transformers English

Gemma 3 12b It Qat Compressed Tensors

Gemma 3 is Google's lightweight cutting-edge open model family, built on the same research and technology used to create Gemini models. This model is multimodal, capable of processing both text and image inputs to generate text outputs.

Qwen2.5 VL 32B Instruct GGUF

Qwen2.5-VL-32B-Instruct is a 32B-parameter multimodal vision-language model that supports joint understanding and generation tasks for images and text.

Image-to-Text English

Gemma 3 27b It Qat Q4 0 Gguf

Gemma is a lightweight open-source multimodal model series launched by Google. It supports text and image inputs and generates text outputs. It has a 128K large context window and supports over 140 languages.

Qwen2 VL 2B Instruct

Qwen2-VL-2B-Instruct is a multimodal vision-language model that supports image-text-to-text tasks.

Transformers English

Featured Recommended AI Models

AIbase

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご

© 2025AIbase